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SIMULTANEOUS ADAPTATION TO THE MARGIN AND TO 
COMPLEXITY IN CLASSIFICATION 

By Guillaume Lecue 

Universite Paris VI 

We consider the problem of adaptation to the margin and to com- 
plexity in binary classification. We suggest an exponential weighting 
aggregation scheme. We use this aggregation procedure to construct 
classifiers which adapt automatically to margin and complexity. Two 
main examples are worked out in which adaptivity is achieved in 
frameworks proposed by Steinwart and Scovel [Learning Theory. Lec- 
ture Notes in Comput. Set. 3559 (2005) 279-294. Springer, Berlin; 
Ann. Statist. 35 (2007) 575-607] and Tsybakov [Ann. Statist. 32 
(2004) 135-166]. Adaptive schemes, like ERM or penalized ERM, 
usually involve a minimization step. This is not the case for our pro- 
cedure. 

1. Introduction. Let (X,A) be a measurable space. Denote by D n a 
sample ((Xi, Yi))i=i n of i.i.d. random pairs of observations where Xi E X 
and Yi 6 {—1,1}. Denote by tt the joint distribution of (Xi,Yi) on X x 
{-1,1}, and P x the marginal distribution of Xi. Let (X, Y) be a random 
pair distributed according to ir and independent of the data, and let the 
component X of the pair be observed. The problem of statistical learning in 
classification (pattern recognition) consists of predicting the corresponding 
value {-1,1}. 

A prediction rule is a measurable function / : X i — > { — 1,1}. The misclas- 
sification error associated with / is 

R(f) = F(Y^f(X)). 

It is well known (see, e.g., Devroye, Gyorfi and Lugosi [15]) that 

mmR(f) = R(f*) = R*, where f*(x) = sign^x) - 1) 
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and r\ is the a posteriori probability denned by 

?]{x)=F(Y = l\X = x), 

for all x £ X [where sign(y) denotes the sign of y E IR with the convention 
sign(O) = 1]. The prediction rule /* is called the Bayes rule and R* is called 
the Bayes risk. A classifier is a function, f n = f n (X,D n ), measurable with 
respect to D n and X with values in {—1,1} that assigns to every sample D n 
a prediction rule / n ( - , D n ) : X i — ►{—1,1}. A key characteristic of f n is the 
generalization error K[R(f n )], where 

R(f n )=P(Y^f n (X)\D n ). 

The aim of statistical learning is to construct a classifier f n such that 
E[i?(/ n )] is as close to R* as possible. Accuracy of a classifier /„ is mea- 
sured by the value K[R(f n )] — R*, called the excess risk of f n . 

The classical approach due to Vapnik and Chervonenkis (see, e.g., [15]) 
consists of searching for a classifier that minimizes the empirical risk 

1 n 

(1-1) Rnif) = -^2 1 {Y i f{x i )<Q), 

n i=i 

over all prediction rules / in a source class where tjy denotes the in- 
dicator of the set A. Minimizing the empirical risk (1.1) is computation- 
ally intractable for many sets T of classifiers, because this functional is 
neither convex nor continuous. Nevertheless, we might base a tractable es- 
timation procedure on minimization of a convex surrogate (p for the loss 
(Cortes and Vapnik [13], Freund and Schapire [17], Lugosi and Vayatis [28], 
Friedman, Hastie and Tibshirani [18] and Biihlmann and Yu [7]). It has 
recently been shown that these classification methods often give classifiers 
with small Bayes risk (Blanchard, Lugosi and Vayatis [5] and Steinwart 
and Scovel [38, 39]). The main idea is that the sign of the minimizer of 
AW(f) = K[</>(yf(X))], the (f>-risk, where cj) is a convex loss function and / a 
real- valued function, is in many cases equal to the Bayes classifier /*. There- 
fore, minimizing A$(f) = i EiLi <f>(Yif(Xi)), the empirical 4>-risk, and tak- 
ing f n = sign(F n ) where F n G Argmrnj 6 jr An (/), leads to an approxima- 
tion for /*. Here, Argmin^ e jrP(/), for a functional P, denotes the set of all 
/ € J- such that P(f) = min/ e _^-P(/). Schapire, Freund, Bartlett and Lee 
[36], Lugosi and Vayatis [28], Blanchard, Lugosi and Vayatis [5], Zhang [48], 
Steinwart and Scovel [38, 39] and Bartlett, Jordan and McAuliffe [2] give re- 
sults on statistical properties of classifiers obtained by minimization of such 
a convex risk. A wide variety of classification methods in machine learning 
are based on this idea, in particular, on using the convex loss associated 
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with support vector machines (Cortes and Vapnik [13] and Scholkopf and 
Smola [37]), 

(j)(x) = (l-x) + , 

called the hinge-loss, where z+ = max(0, z) denotes the positive part of z G R. 
Denote by 

A(f)=E[(l-Yf(X)) + ] 
the hinge risk of / : X i — ► M and set 

(1.2) A*=infA(/), 

where the infimum is taken over all measurable functions /. We will call A* 
the optimal hinge risk. One may verify that the Bayes rule /* attains the 
infimum in (1.2) and Lin [27] and Zhang [48] have shown that 

(1.3) R(f)-R*<A(f)-A*, 

for all measurable functions / with values in R. Thus, minimization of A(f) — 
A* , the excess hinge risk, provides a reasonable alternative for minimization 
of excess risk. 

The difficulty of classification is closely related to the behavior of the a 
posteriori probability n. Mammen and Tsybakov [31], for the problem of 
discriminant analysis which is close to our classification problem, and Tsy- 
bakov [42] have introduced an assumption on the closeness of rj to 1/2, called 
the margin assumption (or low noise assumption) . Under this assumption, 
the risk of a minimizer of the empirical risk over some fixed class T con- 
verges to the minimum risk over the class with a fast rate, namely, faster 
than n" 1 / 2 . In fact, with no assumption on the joint distribution ir, the con- 
vergence rate of the excess risk is not faster than n -1 / 2 (cf. Devroye, Gyorfi 
and Lugosi [15]). However, under the margin assumption, it can be as fast 
as n . Minimizing a penalized empirical hinge risk, under this assumption, 
also leads to fast convergence rates (Blanchard, Bousquet and Massart [4], 
Steinwart and Scovel [38, 39]). Massart [32], Massart and Nedelec [34] and 
Massart [33] also obtain results that can lead to fast rates in classification 
using penalized empirical risk in the special case of a low noise assumption. 
Audibert and Tsybakov [1] show that fast rates can be achieved for plug-in 
classifiers. 

In this paper we consider the problem of adaptive classification. Mam- 
men and Tsybakov [31] have shown that fast rates depend on both the 
margin parameter k and complexity p of the class of candidate sets for 
{x £ X :rj(x) > 1/2}. Their results were nonadaptive, supposing that k and 
p are known. Tsybakov [42] suggested an adaptive classifier that attains fast 
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optimal rates, up to a logarithmic factor, without knowing k and p. Tsy- 
bakov and van de Geer [43] suggest a penalized empirical risk minimization 
classifier that adaptively attains, up to a logarithmic factor, the same fast 
optimal rates of convergence. Tarigan and van de Geer [40] extend this re- 
sult to Zi-penalized empirical hinge risk minimization. Koltchinskii [23] uses 
Rademacher averages to get a similar result without the logarithmic factor. 
Related work is that of Koltchinskii [22] , Koltchinskii and Panchenko [24] 
and Lugosi and Wegkamp [29]. 

Note that the existing papers on fast rates either suggest classifiers that 
can be easily implemented but are nonadaptive, or adaptive schemes that 
are hard to apply in practice and /or do not achieve the minimax rates (they 
pay a price for adaptivity). The aim of the present paper is to suggest and 
to analyze an exponential weighting aggregation scheme which does not re- 
quire a minimization step, unlike other adaptation schemes such as ERM 
(Empirical Risk Minimization) and penalized ERM, and which does not pay 
a price for adaptivity. This scheme is used first to construct minimax adap- 
tive classifiers (cf. Theorem 3.1) and second to construct easily implemented 
classifiers that are adaptive simultaneously to complexity and to the margin 
parameters and which achieve the fast rates. 

The paper is organized as follows. In Section 2 we prove an oracle in- 
equality which corresponds to the adaptation step of the procedure that we 
suggest. In Section 3 we apply the oracle inequality to two types of classifiers, 
one of which is constructed by minimization on sieves (as in Tsybakov [42] ) , 
and which gives an adaptive classifier which attains fast optimal rates with- 
out a logarithmic factor, and the other which is based on support vector 
machines (SVM), following Steinwart and Scovel [38, 39]. The latter is real- 
ized as a computationally feasible procedure and it adaptively attains fast 
rates of convergence. In particular, we suggest a method of adaptive choice 
of the parameter of Ll-SVM classifiers with Gaussian RBF kernels. Proofs 
are given in Section 4. 

2. Oracle inequalities. In this section we give an oracle inequality show- 
ing that a specifically defined convex combination of classifiers mimics the 
best classifier in a given finite set. 

Suppose that we have M > 2 different classifiers f±, . . . , Jm taking values 
in { — 1, 1}. The problem of model selection type aggregation, as studied in 
Nemirovski [35], Yang [46, 47], Catoni [11] and Tsybakov [41], consists in 
construction of a new classifier /„ (called aggregate) which is approxima- 
tively at least as good, with respect to the excess risk, as the best among 
fi, . . . , /m- In most of these papers the aggregation is based on a splitting 
of the sample into two independent subsamples and Df of sizes m and 
I, respectively, where m ^> I and m + l = n. The first subsample is used 
to construct the classifiers f\ , . . . , Jm and the second subsample Df is used 
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to aggregate them, that is, to construct a new classifier that mimics in a 
certain sense the behavior of the best among the classifiers 

In this section we will not consider the sample splitting and will con- 
centrate only on the construction of aggregates (following Nemirovski [35], 
Juditsky and Nemirovski [20], Tsybakov [41], Birge [3] and Bunea, Tsybakov 
and Wegkamp [10]). Thus, the first subsample is fixed, and instead of classi- 
fiers fi , . . . , Jm , we have fixed prediction rules /i , . . . , /m • Rather than work 
with a part of the initial sample, we will suppose, for notational simplicity, 
that the whole sample D n of size n is used for the aggregation step instead 
of a subsample Df. 

Our procedure uses exponential weights. The idea of exponential weights 
is well known; see, for example, Buckland, Burnham and Augustin [8], 
Yang [47], Catoni [11], Hartigan [19] and Leung and Barron [26]. This proce- 
dure has been widely used in on-line prediction; see, for example, Vovk [45] 
and Cesa-Bianchi and Lugosi [12]. We consider the following aggregate which 
is a convex combination with exponential weights of M classifiers: 

M 

(2-1) /n = E^ n) /„ 

where 

i(n) _ exp(E? = i*i/j(*Q) 

EfcLiexp(E? =1 ^/fcTO) 
Since /i, . . . , Jm take their values in {—1, 1}, we have 

(2.3) W f= p>t^MM 

for all j £ {1, . . . , M}, where 

1 n 

(2-4) A n (f) = -Y J (l-Y i f{X l )) + 

i = l 

is the empirical analog of the hinge risk. Since A n (fj) = 2R n (fj) for all 
j = l, . . . , M, these weights can be written in terms of the empirical risks of 
the fj's, 

( B) = exp(-2nE n (/,)) = 
J Ef=iexp(-2ni? n (A)) 

The aggregation procedure defined by (2.1) with weights (2.3) does not 
need any minimization algorithm in contrast to the ERM procedure. More- 
over, the following proposition shows that this exponential weighting ag- 
gregation scheme has theoretical properties similar to those of the ERM 



(2-2) *>r = ru ™ ~'<~" Vj = l,...,M. 
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procedure, up to the residual (logM)/n. In what follows, the aggregation 
procedure defined by (2.1) with exponential weights (2.3) is called the Ag- 
gregation procedure with Exponential Weights and is denoted by AEW. 

Proposition 2.1. Let M>2 be an integer and /i,...,/m be M pre- 
diction rules on X. For any integers n, the AEW procedure f n satisfies 

(2.5) Mfn)< min A n (f i ) + ^&-. 

i=i,...,M n 

Obviously, inequality (2.5) is satisfied when f n is the ERM aggregate 
defined by 

fn G Arg min R n (f). 
/€{/i,...,/m} 

It is a convex combination of /j's with weights Wj = 1 for one j G Argminj R n (fj) 
and otherwise. 

We will use the following assumption (cf. Mammen and Tsybakov [31] and 
Tsybakov [42]) that will allow us to get fast learning rates for the classifiers 
that we aggregate. 

Assumption (MAI) [Margin (or low noise) assumption]. The proba- 
bility distribution tt on the space X x {— 1, 1} satisfies the margin assumption 
(MA1)(k) with margin parameter 1 < k < +oo if there exists c > such that 

(2.6) E{|/(A) -f(X)\}< c(R(f) - R*) 1/K , 
for all measurable functions / with values in {—1,1}. 

We first give the following proposition which is valid not necessarily for 
the particular choice of weights given in (2.2). 

Proposition 2.2. Let Assumption (MAl)^ hold with some 1 < k < 
+oo. Assume that there exist two positive numbers a > 1,6 such that M > 
an b . Let ui\, . . . ,wm be M statistics measurable w.r.t. the sample D n , such 
that w-j > 0, for all j = 1, . . . , M, and J^fLi Wj = 1 (ir m -a.s.). Define f n = 
Y^jL\Wjfj, where /i,...,/m are prediction rules. There exists a constant 
Co > such that 

{l-{\ogM)-y^[A(f n )-A*] 

< E[A n (f n ) - A n (f*)] + Con-^-^logM) 7 / 4 , 

where f* is the Bayes rule. For instance, we can take Co = 10 + ca~ 1 /^ + 
a- 1 /b eX p[(6( 8c /6) 2 ) V (((8c/3) V l)/6) 3 ]. 
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As a consequence, we obtain the following oracle inequality. 

Theorem 2.3. Let Assumption <JAK1)(k) hold with some 1 < k < +oo. 
Assume that there exist two positive numbers a > 1,6 such that M > an 1 *. 
Let / n satisfy (2.5), /or instance, the AEW or the ERM procedure. Then f n 
satisfies 

E[R(f n ) - R*} 

(2.7) 

2 



< 1 + 



log 1 / 4 (M), 

/or a// integers n > 1, where Co > appears in Proposition 2.2. 

Remark 2.1. The factor 2 multiplying minj = i i ... i ^(i?(/j) — i?*) in (2.7) 
is due to the relation between the hinge excess risk and the usual excess 
risk [cf. inequality (1.3)]. The hinge-loss is more adapted for our convex 
aggregate, since we have the same statement without this factor, namely, 

ml.) - A'\ < (l + {-„(«,) -A'H c„i^gi}. 

Moreover, linearity of the hinge-loss on [—1, 1] leads to 

. min (AUj) ~ **) = min (A(f) - A*), 

j=l,...,M jduonv 

where Conv is the convex hull of the set {fj : j = 1, . . . , M}. Therefore, the 
excess hinge risk of f n is approximately the same as one of the best convex 
combinations of /, 's. 

Remark 2.2. For a convex loss function <j>, consider the empirical </>-risk 
A$\f). Our proof implies that the aggregate 

#'(*)=£«*/,(.) wiu.^= Jr { ~" Ait> X fW W=1,...,M, 

j=i Efc=iexp(-n^ '{fk)) 

satisfies the inequality (2.5) with An in place of A n . 

We consider next a recursive analog of the aggregate (2.1). It is close 
to the one suggested by Yang [46] for density aggregation under Kullback 
loss and by Catoni [11] and Bunea and Nobel [9] for the regression model 
with squared loss. It can be also viewed as a particular case of the mirror 
descent algorithm suggested in Juditsky, Nazin, Tsybakov and Vayatis [21]. 
We consider 

-. n M 

(2-8) /» = -E/* = E <0 i/i. 

fc=l 3=1 
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where 

(2.9) % = 1 ± w w = 1 v g£(zM^)) ; 

for all j = l,...,M, where A fc (/) = (l/fc)Ei=i(l - is the empir- 

ical hinge risk of / and iir- is the weight defined in (2.2) for the first k 
observations. This aggregate is especially useful for the on-line framework. 
The following theorem says that it has the same theoretical properties as 
the aggregate (2.1). 

Theorem 2.4. Let Assumption (MA1)(k) hold with some 1 < k < +oo. 
Assume that there exist two positive numbers a > 1,6 such that M > an b . 
Then the convex aggregate f n defined by (2.8) satisfies 



+ C o7 (n,K)log 7 ' 4 (A/) 



for all integers n > 1, where Co > appears in Proposition 2.2 and 7(71, k) 
is eguaZ to ((2k - 1)/(k - l))ra~ K /( 2K_1 ) ifn>l and to (\ogn)/n if k = 1. 

Remark 2.3. For all fe G {1, . . . , n — 1}, less observations are used to 
construct fk than to construct f n \ thus, intuitively, we expect that f n will 
learn better than In view of (2.8), / n is an average of aggregates whose 
performances are, a priori, worse than those of /„; therefore, its expected 
learning properties are presumably worse than those of f n . An advantage 
of the aggregate /„ is its recursive construction, but the risk behavior of f n 
seems to be better than that of f n . In fact, it is easy to see that Theorem 2.4 
is satisfied for any aggregate f n = X)/t=i w kfk, where w k > and Y^l=i w k = 1 
with j(n, k) = J2k=i Wkk~ K ^ 2K ~ 1 ^ , and the remainder term is minimized for 
Wj = 1 when j = n and elsewhere, that is, for f n = f n . 

Remark 2.4. In this section we have dealt only with the aggregation 
step. But the construction of classifiers has to take place prior to this step. 
This requires a split of the sample as discussed at the beginning of this sec- 
tion. The main drawback of this method is that only a part of the sample is 
used for the initial estimation. However, by using different splits of the sam- 
ple and taking the average of the aggregates associated with each of them, 
we get a more balanced classifier which does not depend on a particular 
split. Since the hinge loss is linear on [—1,1], we have the same result as in 
Theorem 2.3 and Theorem 2.4 for an average of aggregates of the form (2.1) 
and (2.8), respectively, for averaging over different splits of the sample. 
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3. Adaptation to the margin and to complexity. In Steinwart and Scovel 
[38, 39] and Tsybakov [42] two concepts of complexity are used. In this 
section we show that combining classifiers used by Tsybakov [42] or the Ll- 
SVM classifiers of Steinwart and Scovel [38, 39] with our aggregation method 
leads to classifiers that are adaptive both to the margin parameter and to 
the complexity in the two cases. Results are established for the first method 
of aggregation defined in (2.1), but they are also valid for the recursive 
aggregate defined in (2.8). 

We use a sample splitting to construct our aggregate. The first subsam- 
ple = ((Xi,Yi), . . . , (X m ,Y m )), where m = n — I and I = [an/ log ra] for 
a constant a > 0, is implemented to construct classifiers and the second 
subsample Df = ((X m+ i,Y m+ i), . . . , (X n ,Y n )) is implemented to aggregate 
them by the procedure (2.1). 

3.1. Adaptation in the framework of Tsybakov. Here we take X = M d . 
Introduce the following pseudo-distance, and its empirical analogue, between 
the sets G, G' C X: 

1 n 

d A (G,G') = P x (GAG'), d A>e (G,G') = - ]T l WeGAff)) 

where GAG' is the symmetric difference between the sets G and G' . If y is 
a class of subsets of X, denote by Ti.B(y, 5, d A ) the 5 -entropy with bracketing 
of y for the pseudo- distance d A (cf. van de Geer [44], page 16). We say that 
y has a complexity bound p > if there exists a constant A > such that 

n B {y,S,d A ) <A5~P V0<<5<1. 

Various examples of classes y having this property can be found in Dud- 
ley [16], Korostelev and Tsybakov [25] and Mammen and Tsybakov [30]. 

Let (Gp)p min <p<p n ^ be a collection of classes of subsets of X, where Q p has 
a complexity bound p, for all Pmin < P < Pmax- This collection corresponds 
to a priori knowledge on ir that the set G* = {x £ X : r/(x) > 1/2} lies in one 
of these classes (typically we have Q p C Q p > if p < p'). The aim of adaptation 
to the margin and complexity is to propose f n , a classifier free of k and 
p such that, if tt satisfies (MA1)(k) and G* £ Q p , then f n learns with the 
optimal rate n - K /( 2K +P _1 ) (optimality has been established in Mammen and 
Tsybakov [31]), and this property holds for all values of k > 1 and p m i n < 
P < Pmax- Following Tsybakov [42], we introduce the following assumption 
on the collection (£p) Pmin <p< Pmax . 

Assumption (Al) (Complexity assumption). Assume that < Pmm < 
Pmax < 1 and the Q p s are classes of subsets of X such that Q p C Q p i for p m \ n < 
p < p' < p m ax and the class Q p has complexity bound p. For any integer n, we 
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define p n j = p min + j^(p max - p m in), j = 0, . . . , N(n), where N(n) satisfies 

A' n b ' < N(n) < A n b , for some finite b > b' > and Aq, A' q > 0. Assume that 
for all n £ N: 

(i) for all j = 0, . . . , N(n), there exists , an e-net on Q Pnj for the 
pseudo-distance d/\ or d&^ e , where e = aj-n — 1 ^ 1+ ^ n ' 3 ' , (ij > and maxj aj < 
+oo; 

(ii) A/^ has complexity bound p n j, for j = 0, . . . , N(n). 

The first subsample is used to construct the ERM classifiers f^n(x) = 
21aj (x) - 1, where E Argmhw i? m (21 G - 1) for all j = 0, . . . , A(m), 

and the second subsample Df is used to construct the exponential weights 
of the aggregation procedure, 

„)_ eM-iAHfL)) v N(m] 

where = (1/0 SLm+iU ~~ is trie empirical hinge risk of 

/ : X i — > R based on the subsample Dp . We consider 

N(m) 

(3.1) /nW=E»f/m( I ) V ^*- 

j=0 

The construction of the /^'s does not depend on the margin parameter k. 

Theorem 3.1. Let (Qp)p min <p<p max be a collection of classes satisfying 
Assumption (Al). Then the aggregate defined in (3.1) satisfies 

sup E[R(f n ) - R*} < C n -«/(2«+P-l) Vn>l, 

/or all 1 < k < +oo and a// p E [p m in> Pm&x], where C > is a constant de- 
pending only on a, b, b' , A, Aq, A' q , p mm , p ma , x and k, and V K ,p is the set of all 
probability measures tt on X x { — 1,1} such that Assumption (M.A1)(k) is 
satisfied and G* £G p . 

3.2. Adaptation in the framework of Steinwart and Scovel. 

3.2.1. The case of a continuous kernel. Steinwart and Scovel [38] have 
obtained fast learning rates for SVM classifiers depending on three parame- 
ters, the margin parameter < a < +oo, the complexity exponent < p < 2 
and the approximation exponent < /3 < 1. The margin assumption was first 
introduced in Mammen and Tsybakov [31] for the problem of discriminant 
analysis and in Tsybakov [42] for the classification problem, in the following 
way: 
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Assumption (MA2) [Margin (or low noise) assumption]. The proba- 
bility distribution tt on the space X x {— 1, 1} satisfies the margin assumption 
(MA2)(a) with margin parameter < a < +oo if there exists Co > such 
that 

(3.2) P(|2r/(X) - 1| <t) <c t a Vt>0. 

As shown in Boucheron, Bousquet and Lugosi [6], the margin Assumptions 
(MAl)(/c) and (MA2)(a) are equivalent with k = ^ for a > 0. 

Let X be a compact metric space. Let H be a reproducing kernel Hilbert 
space (RKHS) over X (see, e.g., Cristianini and Shawe-Taylor [14] and 
Scholkopf and Smola [37]) and Bh its closed unit ball. Denote by N(Bh,e, 
I j 2(Pn)) the ^-covering number of Bh w.r.t. the canonical distance of L2(Pn), 
the L2-space w.r.t. the empirical measure, P„ , on X\, . . . , X n . Introduce the 
following assumptions as in Steinwart and Scovel [38]: 

Assumption (A2). There exist ao > and < p < 2 such that, for any 
integer n, 

sup \ogSf (B H ,e,L 2 {P? )) <a e~ p Ve > 0. 
D n e(Xx{-i,i}) n 

Note that the supremum is taken over all samples of size n and the bound 
is assuming for any n. Every RKHS satisfies (A2) with p = 2 (cf. Steinwart 
and Scovel [38]). We define the approximation error function of the Ll-SVM 

a sa{\)^\ni f&H (\\\f\\l + A{f))-A*. 

Assumption (A3). The RKHS H approximates tt with exponent < 
[3 < 1, if there exists a constant Co > such that a(A) < CqX^, VA > 0. 

Note that every RKHS approximates every probability measure with 
exponent (3 = and the other extremal case (3 = 1 is equivalent to the 
fact that the Bayes classifier /* belongs to the RKHS (cf. Steinwart and 
Scovel [38]). Furthermore, (3 > 1 only for probability measures such that 
P(n(X) = 1/2) = 1 (cf. Steinwart and Scovel [38]). If (A2) and (A3) hold, 
the parameter (p, (3) can be considered as a complexity parameter charac- 
terizing 7r and H. 

Let H be an RKHS with a continuous kernel on X satisfying (A2) with 
parameter < p < 2. Define the Ll-SVM classifier by 

(3.3) / n A = sign(F n A ), where F n A G Argmin(A||/||| + A n (f)); 

A > is called the regularization parameter. Assume that the probability 
measure tt belongs to the set Q a ,f3 of all probability measures on X x {— 1,1} 
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satisfying Assumption (MA2)(a) with a > and (A3) with complexity 
parameter (p,/3), where < (3 < 1. It has been shown in Steinwart and 

Scovel [38] that the Ll-SVM classifier, , where the regularization pa- 
rameter is A"'^ = 7j,- 4 ( a + 1 )/( 2Q +P a + 4 )( 1 +/ 3 ) 5 satisfies the following excess risk 
bound: for any e > 0, there exists C > depending only on a,p,{3 and e 
such that 

(3.4) E[R(ffi' ') - R*} < Cn- 4 ^ a+1 )/« 2a+ ^ +4 )( 1+ ^)+ £ Vn > 1. 

We remark that if (3 = 1, that is, /* G Lf , then the learning rate in (3.4) is (up 
to an e) n- 2 ( a+1 )/( 2a+pa+4 ) , which is a fast rate since 2(a + l)/(2a+po + 4) £ 
[1/2,1). 

To construct the classifier /„" , we need to know parameters a and j3 
that are not available in practice. Thus, it is important to construct a clas- 
sifier, free from these parameters, which has the same behavior as f* n , if 
the underlying distribution ir belongs to Q a ,f3- Below we give such a con- 
struction. 

Since the RKHS H is given, the implementation of the Ll-SVM classifier 
f£ requires only knowledge of the regularization parameter A. Thus, to pro- 
vide an easily implemented procedure, using our aggregation method, it is 
natural to combine Ll-SVM classifiers constructed for different values of A 
in a finite grid. We now define such a procedure. 

We consider the Ll-SVM classifiers defined in (3.3) for the subsample 
D^, where A lies in the grid 

Q{1) = {\ k = r*,» : {>fc = 1/2 + kA~\ k = 0, . . . , L3A/2J }, 

where we set A = l b ° with some bo > 0. The subsample Df is used to aggre- 
gate these classifiers by the procedure (2.1), namely, 

(3-5) /n= E w \fm, 

where 

(l) _ exp(Er =m +i^/A(^)) _ eM-lAW(fa)) 

Ev e6 ( exp(Er= m+ i YifX(Xi)) Eyeg(i) exp(-lM(f^)) 

and A®(f) = (l/i)E^i(l-Wi))+- 

Theorem 3.2. Let H be an RKHS with a continuous kernel on a com- 
pact metric space X satisfying (A2) with parameter <p < 2. Let K be a 
compact subset of (0, +oo) x (0,1]. The classifier f n , defined in (3.5), satis- 
fies 

sup E[R(f n ) - R*} < (7 n -^(«+i)/((W4)(i + « )+e 
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for all (a, /?) G K and e > 0, where Q a ,/3 is the set of all probability measures 
on X x { — 1,1} satisfying (M.A2)(a) and (A2) with complexity parameter 
(p,/3) and C > is a constant depending only on e,p,K,a and bo. 

3.2.2. The case of the Gaussian RBF kernel. In this subsection we apply 
our aggregation procedure to Ll-SVM classifiers using the Gaussian RBF 
kernel. Let X be the closed unit ball of the space M. d ° endowed with the 
Euclidean norm ||x|| = {J2i=i x \ Y^ 2 , ^x = (x±, . . . , x^) EK*. The Gaussian 
RBF kernel is defined as K a (x,x') = exp(— a 2 \\x — x'\\ 2 ) for x,x' G X, where 
a is a parameter and cr" 1 is called the width of the Gaussian kernel. The 
RKHS associated with K a is denoted by H a . 

Steinwart and Scovel [39] introduced the following assumption. 

Assumption (GNA) {Geometric noise assumption). There exist C\ > 
and 7 > such that 

r(X) 2 - 



E 



\2n(X) - l|exp 



t 



< Cit 7do/2 v< > 0. 



Here r is a function on X with values in ]R which measures the distance 
between a given point x and the decision boundary, namely, 

( d(x,G Q \JGi), tfxeG-i, 
t(x)= I d(x,G UG_i), ifxGGi, 
[ 0, otherwise, 

for all x £ X, where G = {x G X : r](x) = 1/2}, G\ = {x G X : r](x) > 1/2} 
and G-i = {x G X : r)(x) < 1/2}. Here d(x, A) denotes the Euclidean distance 
from a point x to the set A. If 7r satisfies Assumption (GNA) for a 7 > 0, 
we say that ir has a geometric noise exponent 7. 

The Ll-SVM classifier associated to the Gaussian RBF kernel with width 
cr^ 1 and regularization parameter A is defined by /n <T,A ' ) = sign(F , ? l <7 ' A ^), where 

Fn a ' X ^ is given by (3.3) with H = H a . Using the standard development re- 
lated to SVM (cf. Scholkopf and Smola [37]), we may write F}f' X \x) = 
YA=iCiK CT (Xi,x),Vx G X, where Ci,...,C n are solutions of the maximiza- 
tion problem 



max 

0<2\CiYi<n- 



i=l *ij=l ' 

which can be obtained using standard quadratic programming software. 
According to Steinwart and Scovel [39], if the probability measure ir on 
X x {—1,1} satisfies the margin Assumption (MA2)(a) with margin pa- 
rameter < a < +00 and Assumption (GNA) with a geometric noise expo- 

nent 7 > 0, the classifier fn ' n , where the regularization parameter and 
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width are defined by 



A a ' 7 



^(7+1)7(27+1)^ 
n -2( 7 +l)(a+l)/(2 7 (a+2)+3a+4) _ 



it 7 < . 

otherwise 



and 

satisfies 
(3.6) 



a 



a, 7 



a I 7'\- 1 /(7+ 1 )rfo 




R* 



n - 7 /(2 7 +l)+ £) 

n -27(a+l)/(2 7 (a+2)+3a+4)+£ 



otherwise, 



for all e > 0, where C > is a constant which depends only on a, 7 and e. 
We remark that fast rates are obtained only for 7 > (3a + 4) /(2a). 

To construct the classifier fn ' n , we need to know parameters a 
and 7, which are not available in practice. As in Section 3.2.1, we use our 
procedure to obtain a classifier which is adaptive to the margin and to 
the geometric noise parameters. Our aim is to provide an easily computed 
adaptive classifier. We propose the following method based on a grid for 
(a, A). We consider the finite sets 

M(l) = { (<a Pl , Vz, P2 ) = f + i) :pi = 1, • ■ • , 2LAJ ; 

p 2 = l ) ...,LA/2j|, 

where we let A = l bo for some 60 > 0, and 

AT(l) = {(a ljV , A w ) = (Z^ d <\r^) : (p, V>) G M(/)}. 

We construct the family of classifiers (fm' X ^ '■ (c, A) ej\f(l)) using the ob- 
servations of the subsample and we aggregate them by the procedure 
(2.1) using Df, namely, 

/ , W cr,\Jm ! 

(a,\)eAT(l) 



(3.7) 
where 



fn 



(3-8) to® 



exp(Er= m+ i^//n' A) (^)) 



(«7',A')GAT(0 



exp(Er= m+ i^^' A,) (^)) 



v(<7,A)eJV(Z). 
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Denote by TZ a ^ the set of all probability measures on X x {—1,1} sat- 
isfying both the margin Assumption (MA2)(a) with a margin parameter 
a > and Assumption (GNA) with a geometric noise exponent 7 > 0. Define 
U = {(a, 7) G (0, +00) 2 : 7 > ^} and W = {(a, 7) G (0, +00) 2 : 7 < ^}. 

Theorem 3.3. Let K be a compact subset ofU and K' a compact subset 
ofU'. The aggregate f n , defined in (3.7), satisfies 

n - 7 /(27+i)+^ if(a,j)eK', 

n -2 7 ( Q +l)/(2 7 (a+2)+3a+4)+ £ ^ {f ^ ^ £ ^ 



sup E[i?(/ n )-i?*]<cf 



/or aZZ (a, 7) G A' U A' and e > 0, where C > depends only on e,K,K',a 
and bo. 

4. Proofs. 

Lemma 4.1. For all positive v,t and all k> 1, i + 1; > v( 2K - 1 )/ 2/t t 1 /( 2/t ) . 

Proof. Since log is concave, we have log(o6) = (1/x) \og(a x ) + (1/y) x 
log(6 y ) < log(a x /x + b y jy) for all positive numbers a, b and x,y such that 
1/x + 1/y = 1; thus ab < a z /x + 6 y /y. Lemma 4.1 follows by applying this 
relation with a = t 1 / ^ ,x = 2k and b = iX 2 " -1 )/ ( 2k ) . □ 

Proof of Proposition 2.1. Observe that (1 — x) + = 1 — a: for x < 1. 
Since Yif n (Xi) < 1 and Yifj(Xi) < 1 for all z = 1, . . . , n and j = 1, . . . , Af , we 

have A n (fn) = EfLi w^Mfj)- We have A n (fj) = MSjo) + sCM^?) ~ 
log(u;j n ^)), for any j,jo = 1, ...,M, where the weights are defined in 
(2.3) by 

(n) = exp(-W^ n (/j)) 

Wj Etie*P(-nMfk))' 

(n) 

and by multiplying the last equation by Wj and summing over j, we get 

IokM 

(4.1) Mfn) < . minMfj) + 

j=l,...,M n 

(n) 

Indeed, we have \og{wfJ) < 0,Vj = and E^li wj n) log(^) = 

K(w\u) > 0, where A(u;|u) denotes the Kullback-Leiber divergence between 
the weights w = (u>j ri ' ) )j=i v ..,M and uniform weights u = (l/M)j=i r ..,M- D 

Proof of Proposition 2.2. Denote 7 = (logM) -1 / 4 , u = 27n" K ^ 2K_1) x 
log 2 M and W„ = (1 - l){A{f n ) - A*) - {Mfn) ~ MS*))- We have 

E[W n ] = Erw^l^,,) + lcw n >«))] 
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<U + E[W n l iWn >u)] 



r+oo 

= u + uF(W n > u) + / F(W n >t)dt 

J u 

r+oo 

<2u + F(W n >t)dt. 

Ju 

On the other hand, (fj)j=i M are prediction rules, so we have A(fj) = 
2R(fj) and A n (fj) = 2R n (f-) (recall that A* = 2R*). Moreover, we work in 
the linear part of the hinge-loss; thus 

P(W n >t) = P^J2 WjiWj) - A*)(l - 7 ) - (A n (fj) - A n (f*))) > tj 
< P(. max J(A(fj) - A*)(l -7) - (A n (fj) ~ Mf*))) > t 

\j=l,...,M 
M 

<Y,nZ j >>y(R(f j )-R*)+t/2) 
i=i 

for all t > u, where Zj = R{fj) - R* - (R n {fj) - R n {f*)) for all j = 1, . . . , M 
[recall that R n (f) is the empirical risk defined in (1.1)]. 

Let j € { 1 , . . . , M} . We can write Zj = (l/n)E£=i(E[Cij] ~ Ci,j), where 
d,j = l(y i/j(Xl )<0) - 1 (Y,f*(X i )<0)- We have \& tj \ < 1 and, under the mar- 
gin assumption, we have V(Cij) < E(C&) = HlfjiX) - f*(X)\] < c{R{fj) - 
R*) l / K , where V is the symbol of the variance. By applying Bernstein's in- 
equality and Lemma 4.1 respectively, we get 

/ tie 

P[Z 7 - > e] < exp — — —, - 

1 J J " V 2c{R(f j )-R*) 1 / K + 2e/3 

( ne 2 \ ( 3ne 

^ e ^{-4c(R(f 3 )-R^) + ^{- — 

for all e > 0. Denote Uj = u/2 + j(R(fj) — R*). After a standard calculation 
we get 

r+oo r+oo 

/ F(Zj > j(R(fj) - R*) + 1/2) dt = 2 F(Zj >e)de<B 1 + B 2 , 



where 



4c(22(/j) -i?*) 1/K / nu 2 



i?i = exp 

nuj 

and 

8 

B 2 = — exp 
3n 



4c(R(fj) -R*) l ' K 
3nu 
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Since R{fj) > R* , Lemma 4.1 yields uj > j(R(fj) -R*) 1 '^ (log Af)^" 1 )/* x 
n —i/2 p Qr an y a > o, the mapping x i— ► (ax) -1 exp(— ax 2 ) is decreasing on 
(0, +oo); thus we have 



4c 



72, 



(logM) 



-(2k-1)/k 







exp ( _2_(l g(M)) 



(4k-2)/k 



The mapping x i — ► (2/a) exp(— ax) is decreasing on (0, +oo) for any a > 
and > 7(logM) 2 n~ K /( 2ft_1 ); thus 

b 2 < i_ exp f_!l n («-l)/(2«-l) (logM) 2Y 
3n \ 4 / 

Since 7 = (logA/)" 1 / 4 , we have E(W„) < 4n- K /( 2K - 1 )(log Af) 7 / 4 + Ti + T 2 , 
where 



4Mc 



and 



(log AT) 



8M 



-(7/s-4)/(4k) , 



exp( --(logM) 



(7k-4)/(2k) 



T 2 = — exp(-(3/4)n( K " 1 )/( 2K - 1 ) (logM) 7 / 4 ). 
3n 

We have T2 < 6(log M) 7 / 4 /n for any integer M > 1. Moreover, k/(2k — 1) < 1 
for all 1 < k < +00, so we get T2 < 6ra~ K /( 2re_1 )(logM) 7 / 4 for any integers 
n > 1 and M > 2. 

Let B be a positive number. The inequality T\ < Bn~ K ^ 2K ~ 1 \\ogM) 7 ^ 4: 
is equivalent to 



^-(logM)( 7K " 4 )/( 2K ) - log M + log(log Af) 



2(2k - 1 

> log((4c/S) 2 ( 2K " 1 )n 



Since we have »~ > | > 1 for all 1 < k < +00 and M > an 6 for some 
positive numbers a and b, there exists a constant B which depends only on 
a, b and c [for instance, B = 4ca _1 /( 26 ) when n satisfies log(an fe ) > (6 2 (8c/6) 2 )V 
((8c/3) V l) 2 ] such that T x < BrT K l i 2 *- 1 ) (log A/) 7 / 4 . □ 

PROOF of Theorem 2.3. Let 7 = (logM) -1 / 4 . Using (4.1), we have 

E[(A(f~ n )-A*)(l- 7 )]-(A(f j0 )-A*) 

= E[(A(f n ) - A*)(l - 7) - (MU) ~ A n (f*))] + E[A»(/«) - A»(/io)] 

logM 



< E[(A(/ n ) - A*)(l - 7 ) " (A»(/n) " A n (/* ))] + 



J? 
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For W n defined at the beginning of the proof of Proposition 2.2 and /* the 
Bayes rule, we have 

(4.2) (l- 7 )(E[A(f n )]-A*)< . min (A{f j )-A*)+E[W n ] + 1 ^-. 

j=i,...,M n 

According to Proposition 2.2, E[W n ) < Con _K /( 2K - 1 )(logM) 7 / 4 , where C > 
is given in Proposition 2.2. Using (4.2) and (1 — 7) _1 < 1 + 2j for any 
< 7 < 1/2, we get 

E[A(f n ) - A*} < ( 1 + — )S min (Alfj) - A*) + C ^!^) ]. 

We complete the proof by using inequality (1.3) and equality 2(R(f) — 
R*) = A(f) — A* , which holds for any prediction rule /. □ 

Proof of Theorem 2.4. Since the f^s take their values in [—1, 1] and 
x h+ (1 - x)+ is linear on [-1, 1], we obtain A(f n ) - A* = ± Y%=i( A (fk) ~ 
A*). Applying Theorem 2.3 to every for k = l,...,n, then taking the 
average of the n oracle inequalities satisfied by the for k = l,...,n and 
seeing that (1/n) J2k=i k~ K ^ 2K ~ 1 ^ <j(n,K,), we obtain 

E[A(f n )-A*] 

< (l + {- M (i(ii) - A*) + C 7 („, K ) W'\M)). 

We complete the proof by the same argument as at the end of the previous 
proof. □ 

Proof of Theorem 3.1. Let p mhl < p < p max and k > 1. Let p m j = 
min(p m j : Pm,j > p)- Since N(m) > A f m b > Cl b , where C > 0, using the 
oracle inequality, stated in Theorem 2.3, we have, for tt satisfying (MA1)(k), 

E[R(f n ) - R*\D l m ] 

2 Uo fDf ii^ d=m , ^log 7/4 iV(m) 



<(!+ ; J M( O l^. min (i?(/4)-i?*) + ^ 
V log 1 N[m)J I j=i,...,A r (m) 



£re/(2re-l) 



where C is a positive number depending only on b', a, A' and c. Taking the 
expectation with respect to the subsample D^, we have 



E[R(f n ) - R*] 

< fi + — ) " *1 + 

V log- 1 / 4 ^^)/ I M 2 -- 1 ) 
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It follows from Tsybakov [42] that the excess risk of f$° satisfies 
sup E[R(f£) - R*] < Cm-^^+^o" 1 ), 

where C is a positive number depending only on A, c, K, p m \n and p max (note 
that C does not depend on pj ) . 

Moreover, we have m > n(l — o/log3 — 1/3), N(m) < Aom 6 < j4o nb an d 
/ > cm/ log n, so that there exists a constant C depending only on a,Ao,A' Q , 
b, b', k, p m i n and p max such that 

(4.3) sup E[R(f n ) - R*] < Cjn-^^+fto- 1 ) + n ^/( 2 ^ 1 )(logn) 11 / 4 }. 

Since p JO < p + TV(m)" 1 < p + (A'q)" 1 [n(l - a/ log 3 - l/3)]~ 6 ', there exists 
a constant C depending only on a, A' ,b' , k, p m i n and p max such that, for 
all integers n, n~ K ^ 2K+p n~^ < C'n~ K /( 2K+p ~ 1 \ Theorem 2.4 follows directly 
from (4.3), seeing that p > p m \ n > and V K)P C since p JO > p. □ 

Proof of Theorem 3.2. Define < a m i n < a max < +oo and < /3 m i n < 
1 such that K C [amin, a^ax] x [P ™™ , 1] • Let (ao,/3rj) £ K. We consider the 
function on (0, +oo) x (0,1] with values in (1/2,2), <f>(a,P) = 4(a + l)/((2a + 
pa + 4)(1 + P))- We take k e {0, . . . , L3A/2] - 1} such that 

cj>l tko = 1/2 + koA" 1 < <K«o, A)) < 1/2 + (k Q + 1) A" 1 . 

For n greater than a constant depending only on K,p,bo and a, there ex- 
ists a G [amin/2,amax] such that (p(a ,P ) = <f>i jko . Since om <p(a,Po) in- 
creases on M + , we have arj < a®. Moreover, we have \(j>{ot\, Pq) — (j>(ot2, Po)\ > 
A\a.\ — a2|,Vai,Q!2 6 [a mm /2, a max ], where ^4 > depends only on p and 
a max . Thus, |d - "o| < (^4A) _1 . Since a < a , we have Q a0) /3 Q Qa ,i3 , so 

sup E[R(f n )-R*}< sup E[12(/ n ) - R*}. 

Since [3 A/2] > (3/2)/ 6 ° , for it satisfying the margin Assumption (MA2)(ao), 
Theorem 2.3 leads to 

E[fi(/„) - R'\D l J 

< U + — vT~ ){* *** W&) - «•) + go '^Iivrl?,' }. 

V log 1/4 ([3A/2])/ I Ae0(i) ;(a +i)/(«o+2) /' 

for all integers n > 1, where Co > depends only on K, a and &o- Therefore, 
taking the expectation w.r.t. the subsample D^, we get 

E[R(f n ) - R*] < C 1 (E[R(f^' k °)-R*} + l^ 0+i y^ 0+ ^ log 7 / 4 (n)), 
where \ik = l~^ Lk ° and C\ > depends only on K,a and 6q. 
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Set r:(0,+oo) x (0,1] i — ► R + defined by F(a,P) = /3<p(a, 0), V(a, 0) G 
(0, +oo) x (0, 1]. According to Steinwart and Scovel [38], if tt G Qa 0t /3 , then 
for all e > 0, there exists C > 0, a constant depending only on K,p and e, 
such that 

K[R(ft k °) - R*] < Cm~ r ^°'M +£ . 

We remark that C does not depend on a® and /?o since (ao, A)) G [a m i n /2, a max ] x 
[Anin,!] arid that the constant multiplying the rate of convergence, stated 
in Steinwart and Scovel [38], is uniformly bounded over (a,/3) belonging to 
a compact subset of (0, +oo) x (0,1]. 

Let e > 0. Assume that tt G Q a0t g - We have n(l — a/log3 — 1/3) <m<n, 
I > an I log n and r(ao,A)) — ( a o + l)/(ao + 2) < 1. Therefore, there exist 
C2-,C' 2 > depending only on a,bo,K,p and e such that, for any n greater 
than a constant depending only on /3 m i n ,a and bo, 

E[R{f n ) - R*) < C 2 (n- r ^°^ +£ + n-^ 0+1 ^^ 0+2 \logn) u ^) 

Moreover, T satisfies |r(ao> A)) — r(ao,A))| < -BA" 1 , where B depends only 
on p and a m i n , and (n BA ) n ^n is upper bounded. This completes the proof. 
□ 



Proof of Theorem 3.3. Let («o,7o) EKliK'. First assume that 
(«o,7o) belongs to K ClU. We consider the set 

S = {(93,-0) € (0,1/2) x (l/2,l):2-2*(/>-</2>0}. 

Each point of S is associated with a margin parameter (3.2) and with a 
geometric noise exponent by the following functions on S with values in 
(0,+oo): 

= 4 t / 2 and 7(y» ^) = - - 1. 
2 — lip — ip ip 

We take (</?, tp) £ S H A4(l) such that a(ip, ip) < ao, j(ip, ip) < 70, a(y>, -0) is 
close enough to ao, j(ip,ip) is close enough to 70 and j((p,ip) > 
(a((p,tp) + 2)/(2a{ip,ip)). Since 70 > (ao + 2)/(2ao), there exists a solution 
(fOi^o) G S of the system of equations 

( 44 ) (a(ip,tp)=ao, 

\l((p,tp) =70. 

For all integers n greater than a constant depending only on K, a and 60, 
there exists (pi,o,P2,o) G {1, . . . ,2[AJ} x {2, . . . , [A/2J} defined by 

Vz,pi,o = min ( ( Pi,p'- l Pi,p > fo) and ipi iP2 = max(^, P2 :^/, P2 < ^0) - A -1 . 
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We have 2-2-0/^2,0-^^1,0 > °- Therefore, (<^, Pl>0 , ^,p 2 ,o) G<SD.M(Z)- Defme 
So = "(^1,0)^2,0) and 70 = 7(^1,01^2,0)- Since (vcV'o) satisfies (4.4), 
we have 

1 -a l + a -a ( 1 \ 1 + a 

^2,0 + £ < ^0 = ^^^0 + ^ < - ^ J + ^ 

and (a /(2a + 4))(2A)- 1 < A -1 ; thus 

, , "0 , l + «o - , 

^2,0 < - 2^ + 4*"*i.o + 2+^ so «o ^ 

With a similar argument, we have ipi, P20 < («o + l)^/ iPl , that is, 70 < 70- 
Now we show that 7o > (So + 2)/(2<5o). Since (ao>7o) belongs to a compact, 
(i^Oi^o) an d [}Pi, Pl ) ^/,P2 0) belong to a compact subset of (0,1/2) x (1/2,1) 
for n greater than a constant depending only on K,a,bQ. Thus, there exists 
A > 0, depending only on K, such that, for n large enough, we have 

|qo — So| < ^lA" 1 and I70 — 70 1 < ^4A _1 . 

Denote cLk = d(dU, K) , where dU is the boundary of U and d(A, B) denotes 
the Euclidean distance between sets A and B. We have dx > since K 
is a compact, dU is closed and K n dU = 0. Set < a m - m < a max < +00 
and < 7 min < 7 max < +00 such that K C [a min ,a max ] x [7 m i n ,7max]- Define 
Ufj, = {(a, 7) G (0,+oo) 2 :a > 2/j, and 7 > (a - \i + 2)/(2(a - /i))} for /i = 
min(a m in/2, cZr-). We have K CU^, so 70 > («o — M + 2)/(2(ao — A*))- Since 
a 1— > (a + 2)/(2a) is decreasing, 70 > 70 — ^4A _1 and ao < So + AA -1 , we 
have 70 > /3(So) — AA" 1 , where ,5 is ci positive function on (0, 2cy max 1 defined 
by p(a) = ( a - (jj, - AA" 1 ) + 2)/(2(q - {p. - ^A" 1 ))). We have ||(ai) - 
/?(a 2 )| > (2a max )~ 2 |ai - a 2 \ for all ai,a 2 G (0,2a max ]. Therefore, /3(S ) - 
ylA -1 > /3(ao + 4^4a 2 nax A~ 1 ). Thus, for n greater than a constant depending 
only on K, a and bo, we have 70 > (So + 2)/(2<5o). 

Since «o < «o an d 70 < 7o> we have 1Z aono C 1Za Q ^ and 

sup E[22(/ n )-lT]< sup E[R(f n ) - R*}. 

If 7r satisfies (MA2)(So), then we get from Theorem 2.3 
E[R(f n ) - R*\Dl] 

(45) 4 + W^7(o) 

2 min - If) + C 2 ), 

(<7,A)eA/"(0 ^(a +l)/(ao+2) /' 

for all integers n > 1, where C2 > depends only on if, a and 60 and M(Z) 
is the cardinality of N(m). We remark that M(l)> l 2b ° /2, so we can apply 
Theorem 2.3. 
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Let e > 0. Since M(l) < n 2b ° and 70 > («o + 2)/(2oo)> taking expectations 
in (4.5) and using the result (3.6) of Steinwart and Scovel [39], for a = 
ai w , and A = A; j,, , we obtain 



sup 



E[R(f n ) - R*] < C(m- ^°^ +£ + r^o+i)/(a +2) i og 7/4( n )) ; 



where6:^^]Ris denned for all (0,7) G W by 9(a, 7) = (2 7 (a + l))/(2 7 (a+ 
2) + 3a + 4) and C > depends only on a, bo, K and e. We remark that the 
constant before the rate of convergence in (3.6) is uniformly bounded on 
every compact of U. We have 6(00,70) < ©(«o>7o) < ©(^OiTo) + 2^4A~ 1 , 
m > n(l — a/log3 — 1/3) and (m 2j4A ) n ^n is upper bounded, so there exists 
Ci > depending only on K, a, b such that m - e (<*o,7o) < G » in -e(a 0l 7o) ) y n > 
1. 

A similar argument as at the end of the proof of Theorem 3.2 and the 
fact that @(q, 7) < (a + l)/(a + 2) for all (a, 7) lead to the result of the 
first part of Theorem 3.3. 

Let now (ao,7o) £ K' • Let o^ax > be such that V(a,7) G K',a < c/ max . 
Take p 1)0 G {1, . . . ,2|_AJ} such that <pi# lfi = min(ip l:P : ip l:P > (270 + 1) _1 and 
p G 4N), where 4N is the set of all integer multiples of 4. For large values 
of n, pifi exists and pi t o G 4N. Denoting 70 G (0, +00) such that (fi >Pl = 
(270 + 1)~\ we have 70 < 70; thus TZ aono C TZ ao ^ and 

sup E[R(f n )-R*}< sup E[R(f n ) - R*}. 

If 7T satisfies the margin assumption (3.2) with the margin parameter oq, 
then, using Theorem 2.3, we obtain, for any integer n > 1, 

E[R(f n ) - R*^] 

2 

logV4(M(0). 
x (2 min {R{ft X) ) - R*) + C ,1°^ w ^?t 1, 

where C > appears in Proposition 2.2 and M(l) is the cardinality of J\f(l). 

Let e > and ^2,0 G {1, . . . , [A/2J } be defined by p2,o =Pi,o/4 (note that 
Pl,0 G 4N). We have 



(4.6) < 1 + 



-l/(d (70+l)) 



Since 70 < («o + 2)/(2ao), using (3.6) of Steinwart and Scovel [39], we have, 
for a = <Ti, VliPlfi and A = A^^ , 

E[R(fjn 0,Xo) ) ~ R*] < Cm- f ^ +£ , 
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where f : (0, +00) 1 — ► M is the function defined by f (7) = 7/(27 + 1) for all 
7 6 (0, +00) and C > depends only on a, bo, K' and s. We remark that, as 
in the first part of the proof, we can uniformly bound the constant before the 
rate of convergence in (3.6) on every compact subset of W . Since M(l) < n 2b °, 
taking the expectation in (4.6), we find 

sup E[R(f n ) - R*] < C{m- T ^ +£ + /-(«o+i)/( Qo +2) ^7/4^ 

where C > depends only on a, bo, K' and e. Moreover, I70 — 7o| < 2(2a^ ax + 
1) 2 A _1 , so |f (70) — f (7o)| < 2(2a max + 1)A~ 1 . To achieve the proof, we use 
the same argument as for the first part of the proof. □ 
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